Today we are going to continue putting it together in Module 4. Today’s material is on making Manhattan plots, which is a commonly used plot type for visualizing the result of genome wide association studies (GWAS). The name comes from its resemblance to the skyscrapers in Manhattan, poking above the background of the rest of the buildings.
Figure from Wikipedia
The plot visualizes the relationship between a trait and genetic markers. The x-axis shows the position on each chromosome, and the y-axis shows the negative log (usually log10) p-value of the quantitative response of a trait to that specific marker. Negative log10 p-value is used because a significant p-value is always small, and this transformation converts low p-value to a number that can be seen easily among the background of non-significant associations.
If you work in genetics/genomics, it is likely you will create Manhattan plots. Even if you think you’ll never make one of these types of plots, its a useful activity to see additional ways of customizing your plots.
library(tidyverse) # everything
library(glue) # easy pasting
library(ggrepel) # repelling labels
Today we are going to continue to use some different real research data collected by Emma Bilbrey from my team where we conducted many GWAS in apple. This work was published in 2021 in New Phytologist and can be found here. This data is more complex than a typical GWAS so we are only going to use a small portion of it.
We will be reading in Table S16 which includes the -log10 p-values for the GWAS conducted across all apples for all features found in the LC-MS negative ionization mode metabolomics dataset.
The data is present in a .csv file, so we will use the function
read_csv() from the tidyverse. We want to import
Supplemental Table 16. You can indicate which sheet you want to import
in the arguments to read_excel().
This will take a second, its a big file.
gwas <- read_csv("data/nph17693-sup-0007-tables16.csv") # be patient
## Rows: 11165 Columns: 4704
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4704): Index, Linkage_Group, Genetic_Distance, X885.2037_2.98177, X525....
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
What are the dimensions of this dataframe? What kind of object is it?
dim(gwas)
## [1] 11165 4704
class(gwas)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Because this dataframe is so big, if we use head(gwas)
we will get a print out of the first 6 rows, and all the columns. In thi
case there are 4704 columns so that will be unwieldy.
Emma came up with a simple way to approach this when she was writing her code, she wrote herself a little function that she could use regularly to extract out the first 5 rows, and the first 5 columns, without having to index each time.
If we wanted to just see the first 5 rows, the first 5 columsn we could do this:
gwas[1:5,1:5]
## # A tibble: 5 × 5
## Index Linkage_Group Genetic_Distance X885.2037_2.98177 X525.1583_3.24969
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0 0.176 0.117
## 2 3 1 0.002 0.0978 0.0166
## 3 4 1 0.003 0.169 0.0519
## 4 5 1 0.004 0.217 0.309
## 5 6 1 0.005 0.0548 0.110
head_short <- function(x){
x[1:5,1:5] # this function shows the first 5 rows and columns of an object
}
Now instead of indexing all the time, we can just run
head_short() which I think is easier. We will talk a little
bit more about writing functions in the class on making many plots at
once.
head_short(gwas)
## # A tibble: 5 × 5
## Index Linkage_Group Genetic_Distance X885.2037_2.98177 X525.1583_3.24969
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0 0.176 0.117
## 2 3 1 0.002 0.0978 0.0166
## 3 4 1 0.003 0.169 0.0519
## 4 5 1 0.004 0.217 0.309
## 5 6 1 0.005 0.0548 0.110
How many markers are included here?
nrow(gwas)
## [1] 11165
How many linkage groups do we have? (Each linkage group is a chromosome.)
unique(gwas$Linkage_Group)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
What is the range of Genetic_Distance for each
chromosome?
gwas %>%
group_by(Linkage_Group) %>%
summarize(min_genetic_distance = min(Genetic_Distance),
max_genetic_distance = max(Genetic_Distance))
## # A tibble: 17 × 3
## Linkage_Group min_genetic_distance max_genetic_distance
## <dbl> <dbl> <dbl>
## 1 1 0 63.1
## 2 2 0 78.4
## 3 3 0 74.0
## 4 4 0.004 65.5
## 5 5 0 77.8
## 6 6 0 75.3
## 7 7 0.001 82.4
## 8 8 0 68.5
## 9 9 0.292 67.0
## 10 10 0 81.3
## 11 11 0 80.9
## 12 12 0.382 65.4
## 13 13 0 71.4
## 14 14 0 64.4
## 15 15 0 112.
## 16 16 0.001 67.5
## 17 17 0 71.8
How are the Index distributed across
Linkage_Group?
gwas %>%
group_by(Linkage_Group) %>%
summarize(min_index = min(Index),
max_index = max(Index))
## # A tibble: 17 × 3
## Linkage_Group min_index max_index
## <dbl> <dbl> <dbl>
## 1 1 1 663
## 2 2 664 1687
## 3 3 1688 2630
## 4 4 2635 3432
## 5 5 3433 4530
## 6 6 4533 5266
## 7 7 5270 5934
## 8 8 5936 6792
## 9 9 6793 7623
## 10 10 7624 8648
## 11 11 8652 9728
## 12 12 9731 10719
## 13 13 10721 11558
## 14 14 11560 12365
## 15 15 12366 13605
## 16 16 13610 14428
## 17 17 14431 15260
Ok here we can see Index does not repeat, but
Genetic_Distance restarts with each chromosome.
At its core, a Manhattan plot is a scatter plot. The data we are working with has 4701 traits, which here are relative metabolite abundance. We are going to pick one metabolite to start working with.
We will start with the feature that represents chlorogenic acid, a
caffeoyl-quinic acid you find in apples. The column we want is
X353.09194_2.23795. The data is already present as the
-log10 p-value for the relationship between allelic variation at that
marker, and relative abundance of chlorogenic acid.
gwas %>%
ggplot(aes(x = Index, y = X353.09194_2.23795, color = Linkage_Group)) +
geom_point()
See how color is plotted on a continuous scale? This is because
Linkage_Group is a continuous, numeric variable. Since each
chromosome is actually discrete, let’s convert
Linkage_Group to a factor and then plot again.
Linkage_Group as a factorgwas$Linkage_Group <- as.factor(gwas$Linkage_Group)
gwas %>%
ggplot(aes(x = Index, y = X353.09194_2.23795, color = Linkage_Group)) +
geom_point()
Better but this really isn’t what we want. We want our x-axis to
indicate the chromosome number in the middle of the block of that
chromosome, not label by Index which just is a key for
linking back to each specific marker.
If we want to label the x-axis with breaks for each chromosome, we have to do some wrangling first. Just like we did some calculations in the lesson on adding statistics, we will calculate some min, center, and max for each chromosome so we know where to put the labels.
(set_axis <- gwas %>%
group_by(Linkage_Group) %>%
summarize(min = min(Index),
max = max(Index),
center = (max - min)/2))
## # A tibble: 17 × 4
## Linkage_Group min max center
## <fct> <dbl> <dbl> <dbl>
## 1 1 1 663 331
## 2 2 664 1687 512.
## 3 3 1688 2630 471
## 4 4 2635 3432 398.
## 5 5 3433 4530 548.
## 6 6 4533 5266 366.
## 7 7 5270 5934 332
## 8 8 5936 6792 428
## 9 9 6793 7623 415
## 10 10 7624 8648 512
## 11 11 8652 9728 538
## 12 12 9731 10719 494
## 13 13 10721 11558 418.
## 14 14 11560 12365 402.
## 15 15 12366 13605 620.
## 16 16 13610 14428 409
## 17 17 14431 15260 414.
gwas %>%
ggplot(aes(x = Index, y = X353.09194_2.23795, color = Linkage_Group)) +
geom_point() +
scale_x_continuous(breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
theme_classic() +
theme(legend.position = "none") + # legend not really necessary
labs(x = "Chromosome",
y = expression("-log"[10]* "P-Value"),
title = "GWAS of chlorogenic acid in apple")
Having a rainbow of colors is not really necessary here,a nd in fact telling exactly where chromosome 15 ends and 16 begins is difficult because the colors are so similar.
What you will see in a lot of papers is people simply alternate the colors of their points by chromosome so you can easily tell which points belong to which chromosome.
gwas %>%
ggplot(aes(x = Index, y = X353.09194_2.23795, color = Linkage_Group)) +
geom_point() +
scale_x_continuous(breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "darkgray"), 17)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(x = "Chromosome",
y = expression("-log"[10]* "P-Value"),
title = "Manhattan Plot after GWAS for Chlorogenic Acid in Apple")
The gap between chromosome 1 and the y-axis of the plot sort of bothers me. Let’s remove it.
gwas %>%
ggplot(aes(x = Index, y = X353.09194_2.23795, color = Linkage_Group)) +
geom_point() +
scale_x_continuous(expand = c(0,0),
breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "grey52"), 17)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(x = "Chromosome",
y = expression("-log"[10]* "P-Value"),
title = "Manhattan Plot after GWAS for Chlorogenic Acid in Apple")
# what would the pvalue cut off with a bonferroni correction be?
bonferroni_pval <- -log10(0.05/nrow(gwas))
gwas %>%
ggplot(aes(x = Index, y = X353.09194_2.23795, color = Linkage_Group)) +
geom_point() +
geom_hline(yintercept = bonferroni_pval, color = "grey", linetype= "dashed") +
scale_x_continuous(expand = c(0,0),
breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "darkgray"), 17)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(x = "Chromosome",
y = expression("-log"[10]* "P-Value"),
title = "Manhattan Plot after GWAS for Chlorogenic Acid in Apple")
# select all SNPs with -log10 pvalue > bonferroni cutoff for chlorogenic acid
chlorogenic_acid_sig <- gwas %>%
filter(X353.09194_2.23795 > bonferroni_pval) %>%
select(Index, Linkage_Group, Genetic_Distance, X353.09194_2.23795)
gwas %>%
ggplot(aes(x = Index, y = X353.09194_2.23795, color = Linkage_Group)) +
geom_point() +
geom_point(data = chlorogenic_acid_sig,
aes(x = Index, y = X353.09194_2.23795), color = "red") +
geom_hline(yintercept = bonferroni_pval, color = "grey", linetype= "dashed") +
scale_x_continuous(expand = c(0,0),
breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "darkgray"), 17)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(x = "Chromosome",
y = expression("-log"[10]* "P-Value"),
title = "Manhattan Plot after GWAS for Chlorogenic Acid in Apple")
We might be interested to know the marker that has the most significant association with chlorogenic acid content, and label it on our plot.
biggest_pval <- chlorogenic_acid_sig %>%
filter(X353.09194_2.23795 == max(X353.09194_2.23795))
gwas %>%
ggplot(aes(x = Index, y = X353.09194_2.23795, color = Linkage_Group)) +
geom_point() +
geom_point(data = chlorogenic_acid_sig,
aes(x = Index, y = X353.09194_2.23795), color = "red") +
geom_label_repel(data = biggest_pval,
aes(x = Index, y = X353.09194_2.23795, label = Index)) +
geom_hline(yintercept = bonferroni_pval, color = "grey", linetype= "dashed") +
scale_x_continuous(expand = c(0,0),
breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "darkgray"), 17)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(x = "Chromosome",
y = expression("-log"[10]* "P-Value"),
title = "Manhattan Plot after GWAS for Chlorogenic Acid in Apple")
In this study, we conducted a series of GWAS on thousands of metabolomic features in apple. What if we wanted to see Manhattan plots for certain features based on how important we could predict they would be? For example, what if we want to see the Manhattan plot for the feature with biggest -log10p-value? Or the feature that has a significant association with the largest number of markers?
To make this wrangling easier, we will convert our data, as we have
many times before, from wide to long with
pivot_longer().
gwas_tidy <- gwas %>%
pivot_longer(cols = 4:ncol(.),
names_to = "Feature",
values_to = "NegLog10P")
We can make another df that includes only the features that have at least one marker where there is a significant p-value.
# make df of associations that pass bonferroni correction
gwas_tidy_bonferroni <- gwas_tidy %>%
filter(NegLog10P > bonferroni_pval)
# how many unique features are there?
length(unique(gwas_tidy_bonferroni$Feature))
## [1] 963
# how many unique markers are there?
length(unique(gwas_tidy_bonferroni$Index))
## [1] 544
There are 963 unique features/metabolite that have a Bonferroni adjusted significant p-value with at least one marker. There are 544 unique markers that have a Bonferroni adjusted significant p-value with at least one feature/metabolite.
What features are associated with the largest number of markers?
gwas_tidy_bonferroni %>%
group_by(Feature) %>%
count() %>%
arrange(desc(n))
## # A tibble: 963 × 2
## # Groups: Feature [963]
## Feature n
## <chr> <int>
## 1 X417.13237_1.82968 46
## 2 X349.15073_1.79191 44
## 3 X601.13217_2.40546 34
## 4 X593.12835_2.53465 31
## 5 X291.0768_2.44657 30
## 6 X591.1485_2.86273 30
## 7 X637.09169_2.78692 30
## 8 X661.08791_2.10005 30
## 9 X137.02484_2.44808 29
## 10 X561.13983_2.53357 29
## # ℹ 953 more rows
Wow, the marker X417.13237_1.82968 has significant
associations with 46 markers. What would that Manhattan plot look
like?
gwas_tidy %>%
filter(Feature == "X417.13237_1.82968") %>%
ggplot(aes(x = Index, y = NegLog10P, color = Linkage_Group)) +
geom_point() +
geom_hline(yintercept = bonferroni_pval, color = "grey", linetype= "dashed") +
scale_x_continuous(expand = c(0,0),
breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "darkgray"), 17)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(x = "Chromosome",
y = expression("-log"[10]* "P-Value"),
title = "Manhattan Plot after GWAS for 417.13237 m/z at retention time 1.82968 in Apple")
What if we want to make Manhattan plots for the 50 features/metabolites that are associated with the most markers? This is probably too many plots to facet, so we can do some calculations, and then write a function to make plots, and apply it over our dataframe.
First, how many significant associations with a Bonferroni multiple testing correction are there?
# make df of associations that pass bonferroni correction
gwas_tidy_bonferroni <- gwas_tidy %>%
filter(NegLog10P > bonferroni_pval)
# how many unique features are this?
gwas_tidy_bonferroni %>%
count(Feature)
## # A tibble: 963 × 2
## Feature n
## <chr> <int>
## 1 X1000.22158_2.71331 12
## 2 X1000.72569_2.72017 2
## 3 X1001.23392_2.70506 5
## 4 X1008.71962_2.91507 3
## 5 X1008.72084_2.64352 2
## 6 X1009.22592_2.91262 5
## 7 X1009.72353_2.64479 2
## 8 X1010.22369_2.64604 2
## 9 X1010.72865_2.64538 1
## 10 X1014.70219_2.12784 2
## # ℹ 953 more rows
# how many unique markers are there?
gwas_tidy_bonferroni %>%
count(Index)
## # A tibble: 544 × 2
## Index n
## <dbl> <int>
## 1 170 1
## 2 217 1
## 3 218 1
## 4 233 2
## 5 294 1
## 6 311 1
## 7 341 2
## 8 368 1
## 9 386 4
## 10 520 6
## # ℹ 534 more rows
Which features are associated with the largest number of markers?
gwas_tidy_bonferroni %>%
count(Feature) %>%
arrange(desc(n))
## # A tibble: 963 × 2
## Feature n
## <chr> <int>
## 1 X417.13237_1.82968 46
## 2 X349.15073_1.79191 44
## 3 X601.13217_2.40546 34
## 4 X593.12835_2.53465 31
## 5 X291.0768_2.44657 30
## 6 X591.1485_2.86273 30
## 7 X637.09169_2.78692 30
## 8 X661.08791_2.10005 30
## 9 X137.02484_2.44808 29
## 10 X561.13983_2.53357 29
## # ℹ 953 more rows
Which markers are associated with the largest number of features?
gwas_tidy_bonferroni %>%
count(Index) %>%
arrange(desc(n))
## # A tibble: 544 × 2
## Index n
## <dbl> <int>
## 1 13684 320
## 2 13685 318
## 3 13681 317
## 4 13715 298
## 5 13657 223
## 6 13660 223
## 7 13675 221
## 8 13630 219
## 9 13623 199
## 10 13617 198
## # ℹ 534 more rows
We will make a new df that includes only the 50 features with the most makers associated with them.
# create a df with only the top 50 features with the most marker associations
top50 <- gwas_tidy_bonferroni %>%
count(Feature) %>%
arrange(desc(n)) %>%
slice_head(n = 50)
# filter the whole dataset to include only these features
gwas_top50 <- gwas_tidy %>%
filter(Feature %in% top50$Feature)
# go from long to wide
gwas_top50_wide <- gwas_top50 %>%
pivot_wider(names_from = Feature, values_from = NegLog10P)
# what are our new dimensions?
dim(gwas_top50_wide)
## [1] 11165 53
head_short(gwas_top50_wide)
## # A tibble: 5 × 5
## Index Linkage_Group Genetic_Distance X599.12186_2.10421 X963.16828_2.48825
## <dbl> <fct> <dbl> <dbl> <dbl>
## 1 1 1 0 0.266 0.238
## 2 3 1 0.002 0.0346 0.255
## 3 4 1 0.003 0.120 0.162
## 4 5 1 0.004 0.497 0.349
## 5 6 1 0.005 0.00356 0.133
Here, we are just modifying our plotting code slightly to allow it to
be run across different features. Here, feature_of_interest
is just the name I’ve assigned here, but you could pick whatever name
you want.
# write a function to make your plots across the features of interest
manhattan_plot <- function(feature_of_interest){
gwas_top50_wide %>% # our df with only the top 50
ggplot(aes(x = Index, color = Linkage_Group)) +
geom_point(aes_string(y = as.name(feature_of_interest))) + # note aes_string, and as.name(feature_of_interest)
geom_hline(yintercept = bonferroni_pval, color = "grey", linetype= "dashed") +
scale_x_continuous(expand = c(0,0),
breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "gray"),17)) +
labs(x = "Chromosome",
y = expression("-log"[10]* "P-Value"),
title = glue("{feature_of_interest}")) + # here we glue the feature name in the title
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
}
Once we have our function written, we can use a function in the
apply() family of functions (here, lapply()
which applies and creates a list). Here I’m just printing the first 10
plots in the list
# use lapply to run your function over the features of interest
# if you don't want your plots to print, you should assign them to something
my_plots <- lapply(names(gwas_top50_wide[,4:53]), # what to apply over
manhattan_plot) # what function to use
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# print the first 10
my_plots[1:10]
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
But you can print them all, save particular ones
usingggsave(), write them all to a pdf, take each one and
edit it more, or do anything else with them.
pdf("top50_plots.pdf") # create a file name for the printed plots to go in
# for loop to print each plot
for(i in 1:50) {
print(my_plots[[i]])
}
dev.off() # closes the plot above
In class, we will practice making Manhattan plots. Click here for the recitation materials.